Skip to content

[Models] Add SharedFusedMoE support to Qwen3MoE#32082

Merged
ywang96 merged 11 commits intovllm-project:mainfrom
Isotr0py:qwen3moe-share-expert
Jan 24, 2026
Merged

[Models] Add SharedFusedMoE support to Qwen3MoE#32082
ywang96 merged 11 commits intovllm-project:mainfrom
Isotr0py:qwen3moe-share-expert

Conversation

@Isotr0py
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py commented Jan 10, 2026

Purpose

  • Fix [Perf] Use vLLM's SharedFusedMoE in Qwen3-Omni vllm-omni#560 (comment)
  • Qwen3-omni's MoE talker has share experts in its sparse moe block, while vLLM's Qwen3MoE impl assume no share expert inside sparse moe block. Therefore, we had to have a duplicate Qwen3MoE sparse moe block implementation at vLLM-omni.
  • This PR upstream it to vLLM to avoid duplicate implementation.

Test Plan

vllm side with:

python examples/offline_inference/vision_language.py -m qwen3_vl

vllm-omni side with https://github.com/Isotr0py/vllm-omni/tree/check-qwen3-omni-moe-talker:

python examples/offline_inference/qwen3_omni/end2end.py

Test Result

Test with Qwen/Qwen3-VL-30B-A3B-Instruct-FP8 with tp_size=2:

INFO 01-12 00:14:59 [utils.py:254] non-default args: {'max_model_len': 4096, 'tensor_parallel_size': 2, 'max_num_seqs': 5, 'limit_mm_per_prompt': {'image': 1, 'video': 0, 'audio': 0}, 'mm_processor_kwargs': {'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, 'model': '/home/mozf/LLM/Qwen3-VL-30B-A3B-Instruct-FP8'}
INFO 01-12 00:14:59 [model.py:528] Resolved architecture: Qwen3VLMoeForConditionalGeneration
...
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:25 [kv_cache_utils.py:1305] GPU KV cache size: 98,176 tokens
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:25 [kv_cache_utils.py:1310] Maximum concurrency for 4,096 tokens per request: 23.97x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  8.36it/s]
Capturing CUDA graphs (decode, FULL): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  8.06it/s]
(EngineCore_DP0 pid=3767062) (Worker_TP0 pid=3767068) INFO 01-12 00:17:27 [gpu_model_runner.py:4837] Graph capturing finished in 2 secs, took 0.19 GiB
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:28 [core.py:273] init engine (profile, create kv cache, warmup model) took 112.92 seconds
(EngineCore_DP0 pid=3767062) INFO 01-12 00:17:31 [core.py:186] Batch queue is enabled with size 2
INFO 01-12 00:17:31 [llm.py:344] Supported tasks: ['generate']
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.17it/s]
Processed prompts: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:02<00:00,  1.66it/s, est. speed input: 1621.29 toks/s, output: 105.99 toks/s]
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, with its pink flowers in full bloom creating a delicate, natural frame around the iconic tower. The clear blue
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms (sakura). The photo is taken from a low angle, looking up through the branches of a cherry blossom tree, with its pink flowers in the foreground creating a natural, delicate frame around the iconic tower
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, with its delicate pink flowers in the foreground. The iconic white tower of the Tokyo Skytree rises in the
--------------------------------------------------
This image captures a beautiful spring scene in Japan, featuring the Tokyo Skytree framed by blooming cherry blossoms. The photograph is taken from a low angle, looking up through the branches of a cherry blossom tree, which creates a natural frame around the iconic tower. The vibrant pink flowers contrast beautifully with the clear blue sky
--------------------------------------------------
  • vllm-omni generates reasonable audio outputs with this impl.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Cursor Bugbot is generating a summary for commit bf79a7a. Configure here.


Note

Introduces shared-expert support in Qwen3MoE using SharedFusedMoE.

  • Replace FusedMoE with SharedFusedMoE; add gate, optional shared_expert_gate, and shared_expert MLP (controlled by shared_expert_intermediate_size)
  • Update forward in sparse MoE block to compute shared_out and fused_out, sum when present, and perform TP all-reduce when needed
  • Extend Qwen3MoeMLP to accept an optional expert_gate and apply sigmoid gating to outputs
  • Switch expert params mapping to SharedFusedMoE.make_expert_params_mapping; set reduce_results=False to defer reduction

Written by Cursor Bugbot for commit bf79a7a. This will update automatically on new commits. Configure here.


Note

Introduces shared-expert support in Qwen3MoE via SharedFusedMoE and optional shared MLP gating.

  • Replace FusedMoE with SharedFusedMoE in sparse MoE block; add gate, optional shared_expert_gate, and shared_expert MLP (controlled by shared_expert_intermediate_size)
  • Update forward to compute shared_out and fused_out, sum when present, and perform TP all-reduce when not sequence-parallel; set reduce_results=False
  • Extend Qwen3MoeMLP to accept expert_gate and apply sigmoid gating to outputs
  • Use SharedFusedMoE.make_expert_params_mapping for expert weight loading

Written by Cursor Bugbot for commit f4ebf48. This will update automatically on new commits. Configure here.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify mergify Bot added the qwen Related to Qwen models label Jan 10, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for SharedFusedMoE to Qwen3MoE, which is a valuable enhancement for handling models with shared experts. The overall approach is sound, but I've identified two critical issues that would lead to runtime errors. One is an incorrect method call with an undefined argument, and the other is improper handling of the return value from SharedFusedMoE.forward. Please see my detailed comments for suggestions on how to fix these issues.

Comment thread vllm/model_executor/models/qwen3_moe.py
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py marked this pull request as ready for review January 11, 2026 16:28
@Isotr0py Isotr0py requested a review from sighingnow as a code owner January 11, 2026 16:28
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py
Copy link
Copy Markdown
Member Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for shared experts in Qwen3MoE by integrating SharedFusedMoE. The changes correctly set up the shared expert and its gating mechanism. However, I've found a critical issue in the tensor parallelism logic where the final output is not correctly reduced across tensor parallel ranks, which will lead to incorrect model outputs when tp_size > 1. I've provided a fix for this issue.

Comment thread vllm/model_executor/models/qwen3_moe.py
@Isotr0py Isotr0py added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 12, 2026
hidden_act: str,
quant_config: QuantizationConfig | None = None,
reduce_results: bool = True,
expert_gate: torch.nn.Linear | None = None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect type hint causes wrong tensor indexing

Low Severity

The expert_gate parameter is typed as torch.nn.Linear | None but is actually a ReplicatedLinear. The code does self.expert_gate(x)[0] to extract the output from a tuple returned by ReplicatedLinear.forward. However, torch.nn.Linear.forward returns a tensor directly, not a tuple, so [0] would incorrectly index into the first dimension of the tensor instead of extracting the output. While the current code works because only ReplicatedLinear is passed, the type annotation is misleading and using torch.nn.Linear as documented would produce silently incorrect results.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Contributor

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! cc @ywang96.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Jan 22, 2026

Hi @Isotr0py, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@ywang96 ywang96 merged commit 8edaf38 into vllm-project:main Jan 24, 2026
53 checks passed
@Isotr0py Isotr0py deleted the qwen3moe-share-expert branch January 24, 2026 07:36
cwazai pushed a commit to cwazai/vllm that referenced this pull request Jan 25, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: 陈建华 <1647430658@qq.com>
lapy pushed a commit to lapy/vllm that referenced this pull request Jan 27, 2026
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
wangxiyuan pushed a commit to vllm-project/vllm-ascend that referenced this pull request Jan 29, 2026
…#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
starmountain1997 pushed a commit to starmountain1997/vllm-ascend that referenced this pull request Jan 31, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
wangxiyuan added a commit to vllm-project/vllm-ascend that referenced this pull request Feb 2, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Tflowers-0129 pushed a commit to Tflowers-0129/vllm-ascend that referenced this pull request Feb 3, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
jiangyunfan1 pushed a commit to jiangyunfan1/vllm-ascend that referenced this pull request Apr 9, 2026
…vllm-project#6335)

### What this PR does / why we need it?
PR vllm-project/vllm#32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

Signed-off-by: whx-sjtu <2952154980@qq.com>
jiangyunfan1 pushed a commit to jiangyunfan1/vllm-ascend that referenced this pull request Apr 9, 2026
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
vllm-project/vllm#31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
vllm-project/vllm#32082 by
vllm-project#6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
vllm-project/vllm#25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants